yufan_yin_week0: 9.9. - 15.9.2020
Also see in the page to my course diary: https://yufanyin.github.io/datavis-R/
and the repository: https://github.com/yufanyin/datavis-R
Structure of the data
learning2019 <- read.csv(file = "D:/Users/yinyf/datavis-R/week0/learning2019.csv", stringsAsFactors = TRUE)
str(learning2019)
## 'data.frame': 218 obs. of 17 variables:
## $ 锘縞luster : int 3 2 1 1 3 1 2 2 1 3 ...
## $ unref : num 4 2 3 2 3 ...
## $ deep : num 3.5 4.25 3.75 4.25 3.25 3.5 4.25 4.25 4 4 ...
## $ orga : num 3.33 3 4.33 3.67 2.67 ...
## $ blocks : num 3.33 3.67 3.67 3 3.67 ...
## $ procrastination: num 3.25 4.25 3.75 2.5 4.25 3.5 3.5 4.25 3.25 2.5 ...
## $ perfectionism : num 3.67 3.33 3.33 2.67 2.33 ...
## $ innateability : num 1 1.5 3 1.5 2.5 2 2 1 2.5 1 ...
## $ ktransforming : num 4 3.67 3.67 3.33 4 ...
## $ productivity : num 1.25 2 1.25 2.25 2.25 2.5 3 2.25 2.25 3.75 ...
## $ gender : int 2 2 2 2 2 2 2 2 2 2 ...
## $ studentstatus : int 1 1 1 1 1 1 1 1 1 3 ...
## $ studylength : int 39 51 3 3 15 3 3 3 3 3 ...
## $ writingcourse : int 2 3 4 0 0 11 0 0 44 35 ...
## $ monthsamel : int 2 2 NA 0 NA 2 4 NA 3 2 ...
## $ no : int 1 2 3 4 5 6 7 8 9 10 ...
## $ faculty : int 2 8 5 9 2 6 4 4 4 9 ...
The aim of the study is to investigate the interrelationships between the approaches to learning and conceptions of academic writing among international university students. Altogether 218 international students of the university participated in the study in 2018 and 2019. Students were divided into homogeneous groups based on their Z scores on the three approaches to learning. Then we compare mean differences and ANOVA results between the profiles.
The data ‘learning2019’ consists of 218 observations and 17 variables. It contains their scores of approaches to learning (different ways that students process information: unreflective studying, deep approach to learning and organised studying), conceptions of academic writing (blocks, procrastination, perfectionism, innate ability, knowledge transforming and productivity), and some background information (categorical variables, eg:gender, age, faculty, student status and study length).
The explanation of some columns are as follows. Each of them was average value of 2-4 questions in 5-point Likert scale (1= totally disagree, 5 = fully agree).
“unref”: relying on memorisation in the learning process, lacking the reflective approach to studying and applying the fragmented knowledge base.
“deep”: comprehending the intentional content, using evidence and integrating with previous knowledge.
“orga”: time management, study organisation, effort management and concentration.
“blocks”: the inability to write productively whose reason is not intellectual capacity or literary skills.
“procrastination”: failing to start or postponing tasks like preparing for exams and doing homework.
“perfectionism”: setting overly high standards, pursuing flawlessness, and evaluating one’s behavior critically.
“innateability”: writing is a skill which “is determined at birth” or “cannot be taught or developed”.
“ktransforming”: (knowledge transforming) using writing for developing knowledge and generating new ideas and in the reflective and dialectic processes.
“productivity”: (sense of productivity) part of self-efficacy in writing.
I understand the basics of data wrangling.
I learned to use R to conduct anayses such as clustering and classification not very proficiently.
Because I attended the course “Introduction to Open Data Science” (HYMY-909, 5 cr) last autumn. Here are the link to my github repository:
https://github.com/yufanyin/IODS-project
and my course diary:
To learn practical data visualization skills using R and the ggplot2 -library. I know little in data visualization.
To learn about good data visualization and avoid bad/incorrect operation.
To produce rich, accurate and concise visualizations using my own data. I have found the proper method to deal with my own data and conducted using SPSS. Attending this course can help me produce better visualizations, which will benefit me a lot when I submit my FIRST article at the end of this year.
yufan_yin_week1: 16.9. - 21.9.2020
Also see in the page to my course diary: https://yufanyin.github.io/datavis-R/
(It is the habit because of another course.)
# Create a vector named my_vector. It should have 7 numeric elements.
my_vector <- c(20, 14, 18, 14, 10, 16, 16)
# Print your vector
my_vector
## [1] 20 14 18 14 10 16 16
# Calculate the minimum, maximum, and median values of your vector
summary(my_vector)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.00 14.00 16.00 15.43 17.00 20.00
# Print "The median value is XX"
mean_exercise1 <- mean(my_vector) # Output from functions can be saved to objects
paste("The median value is ", mean_exercise1) # Use the paste() function to print the object with text
## [1] "The median value is 15.4285714285714"
# Create another vector named my_vector_2. It should have the elements of my_vector divided by 2.
my_vector_2 <- my_vector/2 # Access individual elements of a vector with indices
my_vector_2
## [1] 10 7 9 7 5 8 8
# Create a vector named my_words. It should have 7 character elements.
my_words <- c("swan", "goose", "mallard", "blue_tit", "philomelos", "sparrow", "gull")
# Combine my_vector and my_words into a data frame.
df <- data.frame(my_vector, my_words)
df
## my_vector my_words
## 1 20 swan
## 2 14 goose
## 3 18 mallard
## 4 14 blue_tit
## 5 10 philomelos
## 6 16 sparrow
## 7 16 gull
# Show the structure of the data frame.
str(df)
## 'data.frame': 7 obs. of 2 variables:
## $ my_vector: num 20 14 18 14 10 16 16
## $ my_words : chr "swan" "goose" "mallard" "blue_tit" ...
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --
## √ ggplot2 3.3.2 √ purrr 0.3.4
## √ tibble 3.0.3 √ dplyr 1.0.2
## √ tidyr 1.1.2 √ stringr 1.4.0
## √ readr 1.3.1 √ forcats 0.5.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Use the head() function to print the first 3 rows of your data frame.
head(df) #How to print the first 3 rows instead of 5?
## my_vector my_words
## 1 20 swan
## 2 14 goose
## 3 18 mallard
## 4 14 blue_tit
## 5 10 philomelos
## 6 16 sparrow
# Create a new variable to the data frame which has the values of my_vector_2 (remember to save the new variable to the data frame object).
pair <- c(my_vector_2)
pair
## [1] 10 7 9 7 5 8 8
df2 <- data.frame(df,pair)
df2
## my_vector my_words pair
## 1 20 swan 10
## 2 14 goose 7
## 3 18 mallard 9
## 4 14 blue_tit 7
## 5 10 philomelos 5
## 6 16 sparrow 8
## 7 16 gull 8
# Use filter() to print rows of your data frame greater than the median value of my_vector.
df2 %>% filter(df2 > mean(my_vector))
## my_vector my_words pair
## 1 20 swan 10
## 2 18 mallard 9
## 3 16 sparrow 8
## 4 16 gull 8
yufan_yin_week2: 23.9. - 28.9.2020
Also see in the page to my course diary: https://yufanyin.github.io/datavis-R/
Create a new code chunk where you load the tidyverse package. In the chunk settings, suppress any output messages.
The tibble df has 60 observations (rows) of variables (columns) group, gender, age, score1 and score2 (continuous scores from two tests). Each row represents one participant.
df
## # A tibble: 60 x 4
## group gender score1 score2
## <int> <chr> <dbl> <chr>
## 1 2 F 18.7 14.7563711082321
## 2 1 M 20.1 15.1463059324341
## 3 2 F 17.4 19.0025387614538
## 4 1 M 18.7 15.5693261509451
## 5 2 F 18.5 16.7322250273729
## 6 1 999 16.9 16.4511010915052
## 7 2 M 20.4 15.1008590050657
## 8 1 F 20.3 15.191041952879
## 9 1 F 19.4 13.9717194882152
## 10 2 M 21.2 22.6918520246433
## # ... with 50 more rows
There is something to fix in three of the variables. Explore the data and describe what needs to be corrected.
Hint: You can use e.g. str(), distinct(), and summary() to explore the data.
str(df)
## tibble [60 x 4] (S3: tbl_df/tbl/data.frame)
## $ group : int [1:60] 2 1 2 1 2 1 2 1 1 2 ...
## $ gender: chr [1:60] "F" "M" "F" "M" ...
## $ score1: num [1:60] 18.7 20.1 17.4 18.7 18.5 ...
## $ score2: chr [1:60] "14.7563711082321" "15.1463059324341" "19.0025387614538" "15.5693261509451" ...
summary(df)
## group gender score1 score2
## Min. :1.0 Length:60 Min. :14.17 Length:60
## 1st Qu.:1.0 Class :character 1st Qu.:16.85 Class :character
## Median :1.5 Mode :character Median :17.61 Mode :character
## Mean :1.5 Mean :17.89
## 3rd Qu.:2.0 3rd Qu.:19.01
## Max. :2.0 Max. :21.53
distinct(df)
## # A tibble: 60 x 4
## group gender score1 score2
## <int> <chr> <dbl> <chr>
## 1 2 F 18.7 14.7563711082321
## 2 1 M 20.1 15.1463059324341
## 3 2 F 17.4 19.0025387614538
## 4 1 M 18.7 15.5693261509451
## 5 2 F 18.5 16.7322250273729
## 6 1 999 16.9 16.4511010915052
## 7 2 M 20.4 15.1008590050657
## 8 1 F 20.3 15.191041952879
## 9 1 F 19.4 13.9717194882152
## 10 2 M 21.2 22.6918520246433
## # ... with 50 more rows
The dataset df consists of 60 observations and 5 variables.It contains the membership of group, gender, age, score1, score2.
Make the corrections you described above.
df <- df %>%
mutate(gender = na_if(gender, 999)) # recode 999 to NA (missing)
df$score2 <- as.numeric(df$score2) # convert a character vector to a numeric vector
Count observations by group and gender. Arrange by the number of observations (ascending).
df %>%
count(group, gender) %>% # count() is a combination of group_by() and tally()
arrange(desc(n)) %>% # OR: "%>% floor()"?
arrange(group)
## # A tibble: 6 x 3
## group gender n
## <int> <chr> <int>
## 1 1 M 14
## 2 1 F 13
## 3 1 <NA> 3
## 4 2 F 15
## 5 2 M 14
## 6 2 <NA> 1
Create a new variable, score_diff, that contains the difference between score1 and score2.
df$score_diff <- df$score1 - df$score2
Compute the means of score1, score2, and score_diff.
Hint: Like mutate(), summarise() can take multiple variables in one go.
df %>%
summarise(score1_mean = mean(score1), score2_mean = mean(score2), score_diff_mean = mean(score_diff))
## # A tibble: 1 x 3
## score1_mean score2_mean score_diff_mean
## <dbl> <dbl> <dbl>
## 1 17.9 16.1 1.82
Compute the means of score1, score2, and score_diff by gender.
grouped_df <- df %>%
group_by(gender)
grouped_df %>%
summarise(score1_mean = mean(score1), score2_mean = mean(score2), score_diff_mean = mean(score_diff))
## # A tibble: 3 x 4
## gender score1_mean score2_mean score_diff_mean
## <chr> <dbl> <dbl> <dbl>
## 1 F 17.9 16.3 1.63
## 2 M 18.1 16.0 2.08
## 3 <NA> 16.4 15.0 1.34
Using ggplot2, create a scatter plot with score1 on the x-axis and score2 on the y-axis.
df %>%
ggplot(aes(score1, score2)) + # x = score1, y = Sscore2
geom_point()
Continuing with the previous plot, colour the points based on gender.
Set the output figure width to 10 and height to 6.
df %>%
ggplot(aes(score1, score2, color = gender)) + # x = score1, y = score2
geom_point()
Note: I did this part in another rmd file named ‘index’.
see: https://github.com/yufanyin/datavis-R/blob/master/index.Rmd
Add the author (your name) and date into the metadata section. Create a table of contents.
Knit your document to HTML by changing html_notebook to html_document in the metadata, and pressing Knit.
See the results in my course diary: https://yufanyin.github.io/datavis-R/
yufan_yin_week3: 29.9. - 5.10.2020
Also see in the page to my course diary: https://yufanyin.github.io/datavis-R/
library(tidyverse)
Read the data into R. It have 211 observations of 17 variables.
learning2019 <- read.csv(file = "D:/Users/yinyf/datavis-R/week0/learning2019_week3.csv", stringsAsFactors = TRUE)
learning19 <- learning2019 %>%
mutate(studylength = as.numeric(studylength),
writingcourse = as.numeric(writingcourse))
str(learning19)
## 'data.frame': 206 obs. of 17 variables:
## $ 锘縩o : int 1 2 3 4 5 6 7 8 9 10 ...
## $ cluster : int 3 2 1 1 3 1 2 2 1 3 ...
## $ unref : num 4 2 3 2 3 2.67 1 2.33 3 3.67 ...
## $ deep : num 3.5 4.25 3.75 4.25 3.25 3.5 4.25 4.25 4 4 ...
## $ orga : num 3.33 3 4.33 3.67 2.67 4 2.33 3.33 4 3.67 ...
## $ blocks : num 3.33 3.67 3.67 3 3.67 4 2.67 2.33 3.33 2.67 ...
## $ procrastination: num 3.25 4.25 3.75 2.5 4.25 3.5 3.5 4.25 3.25 2.5 ...
## $ perfectionism : num 3.67 3.33 3.33 2.67 2.33 2.33 4 2.67 3 3.33 ...
## $ innateability : num 1 1.5 3 1.5 2.5 2 2 1 2.5 1 ...
## $ ktransforming : num 4 3.67 3.67 3.33 4 3.33 2 4.33 4 4.33 ...
## $ productivity : num 1.25 2 1.25 2.25 2.25 2.5 3 2.25 2.25 3.75 ...
## $ gender : int 2 2 2 2 2 2 2 2 2 2 ...
## $ studentstatus : int 1 1 1 1 1 1 1 1 1 2 ...
## $ studylength : num 39 51 3 3 15 3 3 3 3 3 ...
## $ writingcourse : num 2 3 4 0 0 11 0 0 44 35 ...
## $ monthsamel : int 2 2 NA 0 NA 2 4 NA 3 2 ...
## $ faculty : int 2 8 5 9 2 6 4 4 4 9 ...
For my data, studylength is more suitable to be the categorical variable than age. It discribes how many months that students have studied in the university.
Cut the continuous variable studylength into a categorical variable studylength_group. Use ggplot2’s cutting function: cut_number() makes n groups with (approximately) equal number of observations.
Count observations by studylength group.
library(ggplot2)
learning19 %>%
mutate(score_group_test = cut_width(studylength, 12, boundary = 0)) %>% # range width is (max - min) / number of groups
count(score_group_test)
## score_group_test n
## 1 [0,12] 102
## 2 (12,24] 47
## 3 (24,36] 19
## 4 (36,48] 14
## 5 (48,60] 16
## 6 (60,72] 5
## 7 (72,84] 2
## 8 (168,180] 1
library(ggplot2)
learning19 %>%
mutate(studylength_group = cut_number(studylength, 3)) %>% # each group has about 206 / 3 = 68 observations
count(studylength_group)
## studylength_group n
## 1 [2,7] 71
## 2 (7,17] 67
## 3 (17,172] 68
Save the results with labels to the data.
learning19 <- learning19 %>%
mutate(studylength_group = cut_number(studylength, 3,
labels = c('-7','8-17','18-')))
learning19 %>%
distinct(studylength_group)
## studylength_group
## 1 18-
## 2 -7
## 3 8-17
The chunk below is supposed to produce a plot but it has some errors.
The figure should be a scatter plot of cluster (different student profiles) on the x-axis and blocks on the y-axis, with points coloured by studylength_group (3 levels). It should also have three linear regression lines, one for each of the education levels.
Fix the code to produce the right figure.
What happens if you use geom_jitter() instead of geom_point()?
Hint: Examine the code bit by bit: start by plotting just the scatter plot without geom_smooth(), and add the regression lines last.
learning19 %>%
ggplot(aes(cluster, blocks, fill = studylength_group)) +
geom_col(position = "dodge") +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
learning19 %>%
ggplot(aes(cluster, blocks)) +
geom_col() +
facet_wrap(~studylength_group)
Calculate the mean, standard deviation (sd), and number of observations (n) of score on blocks by student profiles and study-length group. Also calculate the standard error of the mean (by using sd and n). Save these into a new data frame (or tibble) named cluster_blocks_stats.
cluster_blocks_stats <- learning19 %>%
group_by(cluster, studylength_group, .drop = FALSE) %>% # there are no observations some of the combinations, but we don't drop them
summarise(mean_blocks = mean(blocks),
sd_blocks = sd(blocks),
n = n()) %>%
ungroup()
## `summarise()` regrouping output by 'cluster' (override with `.groups` argument)
cluster_blocks_stats
## # A tibble: 9 x 5
## cluster studylength_group mean_blocks sd_blocks n
## <int> <fct> <dbl> <dbl> <int>
## 1 1 -7 2.48 0.981 31
## 2 1 8-17 2.49 0.870 37
## 3 1 18- 2.38 0.685 26
## 4 2 -7 2.85 0.922 27
## 5 2 8-17 2.53 0.775 22
## 6 2 18- 2.59 0.936 27
## 7 3 -7 3.44 0.906 13
## 8 3 8-17 2.88 1.15 8
## 9 3 18- 3.04 0.845 15
learning19 %>%
ggplot(aes(cluster, blocks)) +
geom_col() +
facet_wrap(~studylength_group)
Using cluster_blocks_stats, plot a bar plot that has cluster on the x-axis, mean score of blocks on the y-axis, and studylength levels in subplots (facets).
Use geom_errorbar() to add error bars that represent standard errors of the mean.
learning19 %>%
ggplot(aes(cluster, blocks)) +
geom_bar(stat = "summary", fun.data = "mean_se") +
facet_wrap(~studylength_group)
stat_summary(geom = "errorbar", fun.data = "mean_se")
## geom_errorbar: na.rm = FALSE, orientation = NA
## stat_summary: fun.data = mean_se, fun = NULL, fun.max = NULL, fun.min = NULL, fun.args = list(), na.rm = FALSE, orientation = NA
## position_identity
Create a figure that has boxplots of cluster (x-axis) by blocks (y-axis).
Note: What does ‘Ord.factor’ mean? I do not know how to change the type of the variable cluster.
learning19 %>%
ggplot(aes(cluster, blocks)) +
geom_boxplot() +
facet_wrap(~studylength_group)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
Group the data by cluster and add mean score of blocks by cluster to a new column mean_score. Do this with mutate() (not summarise()).
Reorder the levels of cluster based on mean_score.
Hint: Remember to ungroup after creating the mean_score variable.
Note: Maybe such types of the variables in my data is not suitable for these operation.
Using the data you modified in exercise 4.2, plot mean scores (x-axis) by cluster (y-axis) as points. The clusters should be ordered by mean score.
Use stat_summary() to add error bars that represent standard errors of the mean.
Hint: Be careful which variable - mean_score or score - you’re plotting in each of the geoms.
Note: Maybe the variables in my data is not suitable for such operation.
yufan_yin_week4: 6.10. - 12.10.2020
Also see in the page to my course diary: https://yufanyin.github.io/datavis-R/
Read the region_scores.csv data
region_scores <- read.csv(file = "D:/Users/yinyf/datavis-R/week4/region_scores.csv", stringsAsFactors = TRUE)
region_scores <- region_scores %>%
mutate(id = as.character(id),
region = factor(region),
education = factor(education, ordered = TRUE),
gender = factor(gender))
glimpse(region_scores)
## Rows: 240
## Columns: 6
## $ id <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", ...
## $ region <fct> South Karelia, Satakunta, Kymenlaakso, South Karelia, Sou...
## $ education <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ gender <fct> M, F, M, F, F, F, M, F, M, M, F, M, F, M, M, M, F, M, F, ...
## $ age <int> 56, 41, 48, 41, 35, 60, 28, 28, 48, 51, 45, 55, 41, 24, 6...
## $ score <dbl> 4.268811, 5.646586, 6.949019, 7.096777, 6.990985, 5.26766...
Cutting values (score) into intervals
to groups of width 10
region_scores %>%
mutate(score_group = cut_width(score, 10, boundary = 0)) %>%
count(score_group)
## score_group n
## 1 [0,10] 55
## 2 (10,20] 154
## 3 (20,30] 31
region_scores <- region_scores %>%
mutate(score_group = cut_width(score, 10, boundary = 0,
labels = c('-10','11-20','21-')))
region_scores %>%
distinct(score_group)
## score_group
## 1 -10
## 2 11-20
## 3 21-
Column score_group is not found.
region_scores2 <- region_scores %>%
group_by(education, score_group, .drop = FALSE) %>%
summarise(mean_age = mean(age),
sd_age = sd(age),
n = n()) %>%
ungroup()
## `summarise()` regrouping output by 'education' (override with `.groups` argument)
region_scores2
## # A tibble: 9 x 5
## education score_group mean_age sd_age n
## <fct> <fct> <dbl> <dbl> <int>
## 1 1 -10 39.5 10.1 46
## 2 1 11-20 38.8 10.2 39
## 3 1 21- NaN NA 0
## 4 2 -10 45 9.27 9
## 5 2 11-20 42.2 9.61 65
## 6 2 21- 39.3 7.57 3
## 7 3 -10 NaN NA 0
## 8 3 11-20 40.1 10.4 50
## 9 3 21- 37.4 8.97 28
Create a figure that shows the distributions (density plots or histograms) of age and score in separate subplots (facets). What do you need to do first?
Note: I’m not sure the group varible to create subplots.
In the figure, set individual x-axis limits for age and score by modifying the scales parameter within facet_wrap().
Question: What went wrong when I used facet_wrap() but saw the warning ‘Layer 1 is missing score_group(or other group variable)’ ? I met last week, too. I saved score_group.
region_scores %>%
ggplot(aes(age, fill = score_group)) +
geom_histogram(position = "identity", alpha = .5, binwidth = 1)
(Try more as a reminder in future)
region_scores %>%
ggplot(aes(age, fill = gender)) +
geom_histogram(position = "identity", alpha = .5, binwidth = 1)
region_scores %>%
ggplot(aes(score, fill = gender)) +
geom_histogram(position = "identity", alpha = .5, binwidth = 1)
Note: I do not understand the meaning of y-axis in such density plots.
region_scores %>%
ggplot(aes(age, fill = gender)) +
geom_density(alpha = .5)
region_scores %>%
ggplot(aes(score, fill = gender)) +
geom_density(alpha = .5)
In this exercise, you will use the built-in iris dataset.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Make the data into long format: gather all variables except species into new variables var (variable names) and measure (numerical values). You should end up with 600 rows and 3 columns (Species, var, and measure). Assign the result into iris_long.
iris_long <- iris %>%
gather(var, measure, -Species)
str(iris_long)
## 'data.frame': 600 obs. of 3 variables:
## $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ var : chr "Sepal.Length" "Sepal.Length" "Sepal.Length" "Sepal.Length" ...
## $ measure: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
In iris_long, separate var into two variables: part (Sepal/Petal values) and dim (Length/Width).
Then, spread the measurement values to new columns that get their names from dim. You must create row numbers by dim group before doing this.
You should now have 300 rows of variables Species, part, Length and Width (and row numbers). Assign the result into iris_wide.
Note: It was a bit complex than the example. I tried many times but failed. So I kept some of the codes in the following chunk.
iris_long %>%
group_by(Species) %>%
mutate(row = row_number()) %>%
ungroup %>%
spread(?, ?) %>%
select(-row)
However,
Must extract column with a single valid subscript. x Subscript `var` has the wrong type `data.frame<Sepal.Width:double>`. i It must be numeric or character.
Or:
iris_long %>%
pivot_wider(names_from = c(var),
values_from = measure)
## Warning: Values are not uniquely identified; output will contain list-cols.
## * Use `values_fn = list` to suppress this warning.
## * Use `values_fn = length` to identify where the duplicates arise
## * Use `values_fn = {summary_fun}` to summarise duplicates
## # A tibble: 3 x 5
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## <fct> <list> <list> <list> <list>
## 1 setosa <dbl [50]> <dbl [50]> <dbl [50]> <dbl [50]>
## 2 versicolor <dbl [50]> <dbl [50]> <dbl [50]> <dbl [50]>
## 3 virginica <dbl [50]> <dbl [50]> <dbl [50]> <dbl [50]>
There is still error.
Using iris_wide, plot a scatter plot of length on the x-axis and width on the y-axis. Colour the points by part.
iris_wide %>%
ggplot(aes(Length, Width), color = Species) + # x = length, y = width
geom_point()
Import your data into R. Check that you have the correct number of rows and columns, column names are in place, the encoding of characters looks OK, etc.
learning2019_w4 <- read.csv(file = "D:/Users/yinyf/datavis-R/week0/learning2019_week4.csv", stringsAsFactors = TRUE)
Print the structure/glimpse/summary of the data. Outline briefly what kind of variables you have and if there are any missing or abnormal values. Make sure that each variable has the right class (numeric/character/factor etc).
learning_w4 <- learning2019_w4 %>%
mutate(studylength = as.numeric(studylength),
writingcourse = as.numeric(writingcourse))
str(learning_w4)
## 'data.frame': 206 obs. of 10 variables:
## $ 锘縞luster : int 3 2 1 1 3 1 2 2 1 3 ...
## $ unref : num 4 2 3 2 3 2.67 1 2.33 3 3.67 ...
## $ deep : num 3.5 4.25 3.75 4.25 3.25 3.5 4.25 4.25 4 4 ...
## $ orga : num 3.33 3 4.33 3.67 2.67 4 2.33 3.33 4 3.67 ...
## $ blocks : num 3.33 3.67 3.67 3 3.67 4 2.67 2.33 3.33 2.67 ...
## $ procrastination: num 3.25 4.25 3.75 2.5 4.25 3.5 3.5 4.25 3.25 2.5 ...
## $ gender : int 2 2 2 2 2 2 2 2 2 2 ...
## $ studentstatus : int 1 1 1 1 1 1 1 1 1 2 ...
## $ studylength : num 39 51 3 3 15 3 3 3 3 3 ...
## $ writingcourse : num 2 3 4 0 0 11 0 0 44 35 ...
Pick a few (2-5) variables of interest from your data (ideally, both categorical and numerical).
For categorical variables, count the observations in each category (or combination of categories). Are the frequencies balanced?
learning19_w4 %>%
count(cluster, gender) %>%
arrange(desc(n)) %>%
arrange(cluster)
Error: Must group by variables found in .data. * Column cluster is not found. Neither is learning19_w4[1]. Well… I’m not very angry.
For numerical variables, compute some summary statistics (e.g. min, max, mean, median, SD) over the whole dataset or for subgroups. What can you say about the distributions of these variables, or possible group-wise differences?
Overall:
summary(learning_w4)
## 锘縞luster unref deep orga
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.670 1st Qu.:3.750 1st Qu.:2.670
## Median :2.000 Median :2.000 Median :4.000 Median :3.330
## Mean :1.718 Mean :2.178 Mean :4.007 Mean :3.411
## 3rd Qu.:2.000 3rd Qu.:2.670 3rd Qu.:4.500 3rd Qu.:4.000
## Max. :3.000 Max. :5.000 Max. :5.000 Max. :5.000
## blocks procrastination gender studentstatus
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.500 1st Qu.:1.000 1st Qu.:2.000
## Median :2.670 Median :3.250 Median :2.000 Median :2.000
## Mean :2.655 Mean :3.212 Mean :1.714 Mean :1.767
## 3rd Qu.:3.330 3rd Qu.:3.750 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :5.000 Max. :5.000 Max. :2.000 Max. :2.000
## studylength writingcourse
## Min. : 2.00 Min. : 0.000
## 1st Qu.: 5.00 1st Qu.: 0.000
## Median : 14.00 Median : 3.000
## Mean : 19.75 Mean : 6.694
## 3rd Qu.: 28.00 3rd Qu.: 6.000
## Max. :172.00 Max. :91.000
For subgroups:
**Note:" I do not believe the mean values of subgroups divided by gender or student status(Bechelor/Master) could be equal. What’s wrong?
grouped_df <- learning_w4 %>%
group_by(studentstatus)
grouped_df %>%
summarise(unref_mean = mean(learning_w4$unref), deep_mean = mean(learning_w4$deep), orga_mean = mean(learning_w4$deep))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 4
## studentstatus unref_mean deep_mean orga_mean
## <int> <dbl> <dbl> <dbl>
## 1 1 2.18 4.01 4.01
## 2 2 2.18 4.01 4.01
We can see studylength (how many month students have been studied in the university) is a better grouping value than (numbers) of writingcourse. But …
Try cluster (student profile based on the combination of scores on ‘unref’, ‘deep’ and ‘orga’)
learning_w4 %>%
count(learning_w4[1])
## 锘縞luster n
## 1 1 94
## 2 2 76
## 3 3 36
grouped_learning <- learning_w4 %>%
group_by(learning_w4[1])
grouped_learning %>%
summarise(unref_mean = mean(grouped_learning$unref), deep_mean = mean(grouped_learning$deep), orga_mean = mean(grouped_learning$orga))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 4
## 锘縞luster unref_mean deep_mean orga_mean
## <int> <dbl> <dbl> <dbl>
## 1 1 2.18 4.01 3.41
## 2 2 2.18 4.01 3.41
## 3 3 2.18 4.01 3.41
# the results look strange but I do not know what went wrong
Describe if there’s anything else you think should be done as “pre-processing” steps (e.g. recoding/grouping values, renaming variables, removing variables or mutating new ones, reshaping the data to long format, merging data frames together).
Do you have an idea of what kind of relationships in your data you would like to visualise and for which variables? For example, would you like to depict variable distributions, the structure of multilevel data, summary statistics (e.g. means), or include model fits or predictions?
Structure of the data
learning2019 <- read.csv(file = "D:/Users/yinyf/datavis-R/week0/learning2019_w4.csv", stringsAsFactors = TRUE)
learning19 <- learning2019[1:13]
str(learning19)
## 'data.frame': 211 obs. of 13 variables:
## $ 锘縞luster : int 3 2 1 1 3 1 2 2 1 3 ...
## $ unref : num 4 2 3 2 3 ...
## $ deep : num 3.5 4.25 3.75 4.25 3.25 3.5 4.25 4.25 4 4 ...
## $ orga : num 3.33 3 4.33 3.67 2.67 ...
## $ blocks : num 3.33 3.67 3.67 3 3.67 ...
## $ procrastination: num 3.25 4.25 3.75 2.5 4.25 3.5 3.5 4.25 3.25 2.5 ...
## $ perfectionism : num 3.67 3.33 3.33 2.67 2.33 ...
## $ innateability : num 1 1.5 3 1.5 2.5 2 2 1 2.5 1 ...
## $ ktransforming : num 4 3.67 3.67 3.33 4 ...
## $ productivity : num 1.25 2 1.25 2.25 2.25 2.5 3 2.25 2.25 3.75 ...
## $ gender : int 2 2 2 2 2 2 2 2 2 2 ...
## $ studentstatus : int 1 1 1 1 1 1 1 1 1 3 ...
## $ studylength : int 39 51 3 3 15 3 3 3 3 3 ...
The aim of the study is to investigate the interrelationships between the approaches to learning and conceptions of academic writing among international university students. Altogether 218 international students of the university participated in the study in 2018 and 2019. Students were divided into homogeneous groups based on their Z scores on the three approaches to learning. Then we compare mean differences and ANOVA results between the profiles.
The data ‘learning2019’ consists of 218 observations and 17 variables. It contains their scores of approaches to learning (different ways that students process information: unreflective studying, deep approach to learning and organised studying), conceptions of academic writing (blocks, procrastination, perfectionism, innate ability, knowledge transforming and productivity), and some background information (categorical variables, eg:gender, age, faculty, student status and study length).
The explanation of some columns are as follows. Each of them was average value of 2-4 questions in 5-point Likert scale (1= totally disagree, 5 = fully agree).
“unref”: relying on memorisation in the learning process, lacking the reflective approach to studying and applying the fragmented knowledge base.
“deep”: comprehending the intentional content, using evidence and integrating with previous knowledge.
“orga”: time management, study organisation, effort management and concentration.
“blocks”: the inability to write productively whose reason is not intellectual capacity or literary skills.
“procrastination”: failing to start or postponing tasks like preparing for exams and doing homework.
“perfectionism”: setting overly high standards, pursuing flawlessness, and evaluating one’s behavior critically.
“innateability”: writing is a skill which “is determined at birth” or “cannot be taught or developed”.
“ktransforming”: (knowledge transforming) using writing for developing knowledge and generating new ideas and in the reflective and dialectic processes.
“productivity”: (sense of productivity) part of self-efficacy in writing.
summary(learning19)
## 锘縞luster unref deep orga
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.667 1st Qu.:3.750 1st Qu.:2.667
## Median :2.000 Median :2.000 Median :4.000 Median :3.333
## Mean :1.716 Mean :2.171 Mean :4.007 Mean :3.414
## 3rd Qu.:2.000 3rd Qu.:2.667 3rd Qu.:4.500 3rd Qu.:4.000
## Max. :3.000 Max. :5.000 Max. :5.000 Max. :5.000
## blocks procrastination perfectionism innateability
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.500 1st Qu.:2.000 1st Qu.:1.000
## Median :2.667 Median :3.250 Median :2.333 Median :1.500
## Mean :2.662 Mean :3.219 Mean :2.556 Mean :1.761
## 3rd Qu.:3.333 3rd Qu.:3.875 3rd Qu.:3.333 3rd Qu.:2.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## ktransforming productivity gender studentstatus
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.667 1st Qu.:1.875 1st Qu.:1.000 1st Qu.:3.000
## Median :4.000 Median :2.500 Median :2.000 Median :4.000
## Mean :4.041 Mean :2.487 Mean :1.716 Mean :3.185
## 3rd Qu.:4.667 3rd Qu.:3.250 3rd Qu.:2.000 3rd Qu.:4.000
## Max. :5.000 Max. :4.750 Max. :2.000 Max. :4.000
## studylength
## Min. : 2.00
## 1st Qu.: 5.00
## Median : 14.00
## Mean : 21.63
## 3rd Qu.: 28.50
## Max. :172.00
Calculate and print the correlation matrix
cor_matrix<-cor(learning19[2:10]) %>% round(digits = 2)
cor_matrix
## unref deep orga blocks procrastination perfectionism
## unref 1.00 -0.48 -0.31 0.33 0.25 0.28
## deep -0.48 1.00 0.32 -0.27 -0.18 -0.19
## orga -0.31 0.32 1.00 -0.22 -0.38 -0.14
## blocks 0.33 -0.27 -0.22 1.00 0.55 0.54
## procrastination 0.25 -0.18 -0.38 0.55 1.00 0.35
## perfectionism 0.28 -0.19 -0.14 0.54 0.35 1.00
## innateability 0.16 -0.11 -0.02 0.24 0.13 0.28
## ktransforming -0.16 0.31 0.16 -0.30 -0.21 -0.25
## productivity -0.15 0.16 0.30 -0.38 -0.46 -0.22
## innateability ktransforming productivity
## unref 0.16 -0.16 -0.15
## deep -0.11 0.31 0.16
## orga -0.02 0.16 0.30
## blocks 0.24 -0.30 -0.38
## procrastination 0.13 -0.21 -0.46
## perfectionism 0.28 -0.25 -0.22
## innateability 1.00 -0.25 0.01
## ktransforming -0.25 1.00 0.21
## productivity 0.01 0.21 1.00
Specialized according to the significant level and visualize the correlation matrix p.mat <- cor.mtest(cor_matrix)$p
library(corrplot)
## corrplot 0.84 loaded
p.mat <- cor.mtest(cor_matrix)$p
corrplot(cor_matrix, method="circle", type="upper", tl.cex = 0.6, p.mat = p.mat, sig.level = 0.01, title="Correlations of learning19", mar=c(0,0,1,0))
learning19 %>%
ggplot(aes(orga, procrastination, color = cluster)) + # x = orga, y = procrastination
geom_point()
Euclidean distance matrix
learning19_eu <- dist(learning19[2:4])
summary(learning19_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.083 1.601 1.741 2.192 6.741
set.seed(123)
k_max <- 5 # determine the number of clusters
twcss <- sapply(1:k_max, function(k){kmeans(learning19[2:4], k)$tot.withinss}) # calculate the total within sum of squares
qplot(x = 1:k_max, y = twcss, geom = 'line') # visualize the results
The twcss value decrease heavily from 2 - 5 clusters. The optimal number of clusters was 3.
learning19_km <- kmeans(learning19[2:10], centers = 3)
Plot the dataset with clusters
pairs(learning19[2:10], col = learning19_km$cluster)
pairs(learning19[,2:4], col = learning19_km$cluster)
pairs(learning19[,5:10], col = learning19_km$cluster)
The optimal number of clusters was 3. We got the best overview with three clusters.
library(devtools)
library(flipMultivariates)
learning19_scaled3 <- scale(learning19[2:4])
learning19_km3 <-kmeans(learning19_scaled3, centers = 3)
cluster <- learning19_km3$cluster
learning19_scaled3 <- data.frame(learning19_scaled3, cluster)
lda.fit_cluster <- lda(cluster ~ ., data = learning19_scaled3)
lda.fit_cluster
Warning in install.packages : package ‘flipMultivariates’ is not available
but I used to run it so I kept the codes.
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes3 <- as.numeric(learning19_scaled3$cluster)
plot(lda.fit_cluster, dimen = 2, col = classes3, pch = classes3, main = "LDA biplot using three clusters")
lda.arrows(lda.fit_cluster, myscale = 2)
model_predictors <- dplyr::select(learning19_train, -deep2)
# check the dimensions
dim(model_predictors)
dim(lda.fit$scaling)
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
Next, install and access the plotly package.
Create a 3D plot of the columns of the matrix product.
library(plotly)
plot_ly (x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = learning19_train$deep2)
library(plot3D)
scatter3D(x = learning19$unref, y = learning19$deep, z = learning19$orga, col = NULL,
main = "learning19 data", xlab = "deep",
ylab ="unref", zlab = "orga")
library(plotly)
plot_ly (x = learning19$unref, y = learning19$deep, z = learning19$orga, type= 'scatter3d', mode='markers', color = learning19$deep)
yufan_yin_week5: 13.10. - 19.10.2020
Read the file timeuse_tidy.rds with readRDS(). The file contains the dataset that we tidied in the exercise session: records of daily time use from participants over multiple days. Note that since the data has been stored as rds (R-specific format), column types and factor levels are as we left them, and don’t need to be re-corrected.
readRDS(file = "D:/Users/yinyf/datavis-R/week5/timeuse_tidy.rds")
## # A tibble: 26,568 x 9
## indivID date female age occ_full_time activity_class time_spent
## <chr> <date> <fct> <dbl> <fct> <fct> <dbl>
## 1 1013302 2016-11-11 0 70 0 Lifts 0
## 2 1013302 2016-11-11 0 70 0 Work 0
## 3 1013302 2016-11-11 0 70 0 Education 0
## 4 1013302 2016-11-11 0 70 0 Shopping 0
## 5 1013302 2016-11-11 0 70 0 Business 0
## 6 1013302 2016-11-11 0 70 0 Petrol 0
## 7 1013302 2016-11-11 0 70 0 Social / Leis~ 0
## 8 1013302 2016-11-11 0 70 0 Vacation 0
## 9 1013302 2016-11-11 0 70 0 Exercise 6
## 10 1013302 2016-11-11 0 70 0 Home 1424
## # ... with 26,558 more rows, and 2 more variables: weekday <ord>,
## # week_number <dbl>
df <- readRDS(file = "D:/Users/yinyf/datavis-R/week5/timeuse_tidy.rds")
summary(df)
## indivID date female age
## Length:26568 Min. :2016-10-14 0:12024 Min. :21.00
## Class :character 1st Qu.:2016-11-14 1:14544 1st Qu.:34.50
## Mode :character Median :2016-12-02 Median :34.50
## Mean :2016-11-26 Mean :40.51
## 3rd Qu.:2016-12-11 3rd Qu.:54.50
## Max. :2016-12-27 Max. :80.00
##
## occ_full_time activity_class time_spent weekday week_number
## 0: 8460 Business : 2214 Min. : 0 Mon:3828 Min. :42.00
## 1:18108 Education: 2214 1st Qu.: 0 Tue:3912 1st Qu.:46.00
## Exercise : 2214 Median : 0 Wed:3708 Median :49.00
## Home : 2214 Mean : 120 Thu:3276 Mean :47.83
## Lifts : 2214 3rd Qu.: 19 Fri:3528 3rd Qu.:50.00
## Petrol : 2214 Max. :1440 Sat:3876 Max. :52.00
## (Other) :13284 Sun:4440
Create a new variable that contains combined activity classes: “Work or school” (Work, Business, Education), “Free time” (Shopping, Social / Leisure, Home, Vacation), and “Other”.
df <- df %>%
mutate(activity_class = as.character(activity_class))
df_wide <- df %>%
group_by(activity_class) %>%
mutate(row = row_number()) %>%
ungroup %>%
spread(activity_class, time_spent) %>%
select(-row) #long to wide
head(df_wide)
## # A tibble: 6 x 19
## indivID date female age occ_full_time weekday week_number Business
## <chr> <date> <fct> <dbl> <fct> <ord> <dbl> <dbl>
## 1 1013302 2016-11-11 0 70 0 Fri 46 0
## 2 1013302 2016-11-12 0 70 0 Sat 46 0
## 3 1013302 2016-11-13 0 70 0 Sun 46 0
## 4 1013302 2016-11-14 0 70 0 Mon 46 0
## 5 1013302 2016-11-15 0 70 0 Tue 46 0
## 6 1013302 2016-11-16 0 70 0 Wed 46 0
## # ... with 11 more variables: Education <dbl>, Exercise <dbl>, Home <dbl>,
## # Lifts <dbl>, `Non-Allocated` <dbl>, Petrol <dbl>, Shopping <dbl>, `Social /
## # Leisure` <dbl>, Travel <dbl>, Vacation <dbl>, Work <dbl>
df_long1 <- df_wide %>%
gather(Free_time, value2, `Shopping`, `Social / Leisure`, `Home`, `Vacation`) #wide to long, I did not know the more concise way and had to do for 3 times
df_long2 <- df_long1 %>%
gather(Work_or_school, value1, Work, Business, Education)
df2 <- df_long2 %>%
gather(Other, value3, Exercise:Travel)
head(df2) # the final results should be two columns ('activity_class' and 'time_spent'). Maybe rename column or values and then convert wide to long for one or two times. However, I could not figure out.
## # A tibble: 6 x 13
## indivID date female age occ_full_time weekday week_number Free_time
## <chr> <date> <fct> <dbl> <fct> <ord> <dbl> <chr>
## 1 1013302 2016-11-11 0 70 0 Fri 46 Shopping
## 2 1013302 2016-11-12 0 70 0 Sat 46 Shopping
## 3 1013302 2016-11-13 0 70 0 Sun 46 Shopping
## 4 1013302 2016-11-14 0 70 0 Mon 46 Shopping
## 5 1013302 2016-11-15 0 70 0 Tue 46 Shopping
## 6 1013302 2016-11-16 0 70 0 Wed 46 Shopping
## # ... with 5 more variables: value2 <dbl>, Work_or_school <chr>, value1 <dbl>,
## # Other <chr>, value3 <dbl>
Calculate the mean time spent on each of the combined activity classes, grouped by weekday, participant ID, and occ_full_time.
grouped_df2 <- df2 %>%
group_by(weekday)
grouped_df2 %>%
summarise(Work_or_school_mean = mean(value1), Free_time_mean = mean(value2), Other_mean = mean(value3))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 4
## weekday Work_or_school_mean Free_time_mean Other_mean
## <ord> <dbl> <dbl> <dbl>
## 1 Mon 87.2 253. 33.6
## 2 Tue 98.6 246. 32.3
## 3 Wed 106. 233. 38.6
## 4 Thu 104. 236. 36.9
## 5 Fri 88.4 249. 35.8
## 6 Sat 18.9 304. 33.5
## 7 Sun 15.0 312. 29.3
grouped_df2 <- df2 %>%
group_by(indivID)
grouped_df2 %>%
summarise(Work_or_school_mean = mean(value1), Free_time_mean = mean(value2), Other_mean = mean(value3))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 356 x 4
## indivID Work_or_school_mean Free_time_mean Other_mean
## <chr> <dbl> <dbl> <dbl>
## 1 1013302 28.1 326. 10.7
## 2 1056237 0 359. 0.943
## 3 1103940 42.1 310. 14.9
## 4 118068 95.5 256. 25.7
## 5 1198262 82.9 282. 12.6
## 6 1202035 87.9 204. 72.2
## 7 121881 0.238 323. 29.7
## 8 1226238 68 264. 35.8
## 9 1292043 87.6 268. 20.6
## 10 1326897 54.8 267. 41.6
## # ... with 346 more rows
grouped_df2 <- df2 %>%
group_by(occ_full_time)
grouped_df2 %>%
summarise(Work_or_school_mean = mean(value1), Free_time_mean = mean(value2), Other_mean = mean(value3))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 4
## occ_full_time Work_or_school_mean Free_time_mean Other_mean
## <fct> <dbl> <dbl> <dbl>
## 1 0 47.2 287. 30.2
## 2 1 83.0 253. 35.9
Visualise the means you calculated.
If I had got the right results in 1.1, the code here should be:
```fig.width=10, fig.height=8 df2 %>% ggplot(aes(activity_class2, time_spent, group = weekday, colour = weekday)) + geom_point()+ facet_wrap(~activity_class2) labs(x = “activity type”, y = “Average time spent (minutes)”, colour = “Activity type”) +
df2 %>% ggplot(aes(activity_class2, time_spent, group = indivID, colour = indivID)) + geom_point()+ facet_wrap(~activity_class2) labs(x = “activity type”, y = “Average time spent (minutes)”, colour = “Activity type”) +
df2 %>% ggplot(aes(activity_class2, time_spent, group = occ_full_time, colour = occ_full_time)) + geom_point()+ facet_wrap(~activity_class2) labs(x = “activity type”, y = “Average time spent (minutes)”, colour = “Activity type”) +
df2 %>% ggplot(aes(weekday, time_spent, group = week_number, color = activity_class)) + geom_line(size=1, alpha = .1) + geom_point(alpha = .6) + facet_wrap(~activity_class, scales = “free_y”) + labs(x = “Weekday”, y = “Average time spent (minutes)”, color = “Activity type”) + theme_bw() + theme(legend.position = “none”)
Now I have to only use 'Work_or_school' as an example
```r
df2 %>%
ggplot(aes(weekday, value1, group = week_number, color = Work_or_school)) +
geom_line(size=1, alpha = .1) +
geom_point(alpha = .6) +
facet_wrap(~Work_or_school, scales = "free_y") +
labs(x = "Weekday", y = "Average time spent (minutes)", colour = "Activity type") +
theme_bw() +
theme(legend.position = "none")
What is computed in the code chunk below - what do the numbers tell you?
Can you think of another way to calculate the same thing?
df2 %>%
distinct(indivID, date) %>%
arrange(date) %>%
count(date)
## # A tibble: 73 x 2
## date n
## <date> <int>
## 1 2016-10-14 6
## 2 2016-10-15 11
## 3 2016-10-16 10
## 4 2016-10-17 10
## 5 2016-10-18 14
## 6 2016-10-19 16
## 7 2016-10-20 11
## 8 2016-10-21 11
## 9 2016-10-22 17
## 10 2016-10-23 18
## # ... with 63 more rows
grouped_df2 <- df2 %>%
group_by(date)
grouped_df2 %>%
summarise(n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 73 x 2
## date n
## <date> <int>
## 1 2016-10-14 360
## 2 2016-10-15 660
## 3 2016-10-16 600
## 4 2016-10-17 600
## 5 2016-10-18 840
## 6 2016-10-19 960
## 7 2016-10-20 660
## 8 2016-10-21 660
## 9 2016-10-22 1020
## 10 2016-10-23 1080
## # ... with 63 more rows
My excercise ended here. I may continue later.
The direct reason is that the button ‘run’ disappeared when I did excercise after 2.2 and I could not run any chunk. However, the main reason is that I did not keep the pace of the lecture at all last week. I did not know why it was so difficult/abstract to understand. By the way, my study is cross-sectional and there is no variable regarding of time.
I would like to say something about the course.
The first one is about the appproach to teaching. Lacking of interaction impairs the quality of teaching.
I do not mean the timely Q&A during you had already done in the lectures. In the online UH MOOC last year (https://mooc.helsinki.fi/course/view.php?id=273&lang=en; it has been run for many rounds) or video remote R course this semester (friends are taking), thanks to the interactive applets, short instructional video (can be stopped anytime) and active forum helpping each other, we had time to understand, digest and solve most of the problems. Considering the size and type of the course, I know some of them are unrealistic for ours. But the impact does exist.
The second one is about the assessment. The grade every week is a little bit strict.
It is pass/fail. At the same time, ‘a valiant effort without full completion’ gives half of the points. The criteria is reasonable for a 2-credit course but this one is 5-credit and intensive. In other 5-credit R course, either every task is graded by 5 points, or the criteria is not harsh.
If there is only one or two wrong words (eg.week2ex3, week4ex4, the means are equal between the groups because I used df&variable. The name of one variable start with a garbled code and I made the same choice when other could be found directly), in practice, it cannot work at all. But it is in a course, it lead to 1/2 points, which was the same as I wrote a chunk ending with a incomplete plot or even without drawing. Therefore it is so easy to be on the edge of losing all the 5 credits, like me (got 19/36 points before this week).
I hesitated to write those above. Any course on R is tough and full of error. I am not sure how many students have the similar confusion. Moreover, we have already been a doctoral students and do not need to value credits too much. I just do not want to give up without struggling (and my field is teaching and learning in higher education): since the aim of attending courses is to learn something, shall a student stop when he/she heard something but was unable to master it? It sounds like the course and teachers abandon the participants without communication, as long as they did not keep the pace.
Either for acquiring skills or credits, I hope I can continue attending this course.
Plot the numbers from above (use points, lines, or whatever you think is suitable).
df2 %>%
ggplot(aes(date, indivID)) +
geom_point()
Count the total number of participants in the data.
For each participant, count the number of separate days that they recorded their time use on.
Explain step by step what happens in the code chunk below, and what the final figure represents.
df2 %>%
group_by(indivID) %>%
mutate(start_date = min(date)) %>%
ungroup %>%
mutate(indivID = factor(indivID),
indivID = fct_reorder(indivID, start_date) %>% fct_rev()) %>%
ggplot(aes(date, indivID, colour = month(start_date, label = T))) +
geom_line() +
geom_point(size=.5, alpha=.1) +
theme_bw() +
scale_y_discrete(breaks = "none") +
labs(x = "Date", y = "", colour = "Starting month")
yufan_yin_week6: 20.10. - 27.10.2020
Also see in the page to my course diary: https://yufanyin.github.io/datavis-R/
The data frames df_w and df_f represent repeated measures data from 60 participants. Variables F1-F3 and W1-W3 are “sub-variables” that will be used to make two composite variables F_total and W_total, respectively.
Merge the two data frames together.
df_f <- df_f %>%
mutate(session = as.factor(session)) # many errors occurred when I tried to change the type of 'session' and 'group'. Q: I still did not understand why only factor works.
df <- full_join(df_f, df_w, by = c("id" = "id", "session" = "session", "group" = "group"), suffix = c("_f", "_w"))
head(df)
## id session group F1 F2 F3 W1 W2 W3
## 1 1 2 1 0 0 3 1 0 0
## 2 1 1 1 3 0 0 2 3 2
## 3 2 2 1 2 0 2 3 2 0
## 4 2 1 1 0 0 0 0 2 1
## 5 3 2 1 1 0 0 0 3 3
## 6 3 1 1 0 0 0 3 2 0
Using the merged data frame, create the composite variables F_total and W_total, which are the sums of F1-F3 and W1-W3, respectively (i.e. their values can range from 0 to 9).
df$F_total <- rowSums(df[, c('F1', 'F2', 'F3')])
df$W_total <- rowSums(df[, c('W1', 'W2', 'W3')])
# I searched all the material and did not find row or column sums were taught. Why it could be an exercise?
head(df)
## id session group F1 F2 F3 W1 W2 W3 F_total W_total
## 1 1 2 1 0 0 3 1 0 0 3 1
## 2 1 1 1 3 0 0 2 3 2 3 7
## 3 2 2 1 2 0 2 3 2 0 4 5
## 4 2 1 1 0 0 0 0 2 1 0 3
## 5 3 2 1 1 0 0 0 3 3 1 6
## 6 3 1 1 0 0 0 3 2 0 0 5
Visualise the distributions of F_total and W_total for the two groups and measurement sessions (for example as boxplots).
df %>%
ggplot(aes(session, F_total)) +
geom_boxplot() +
facet_wrap(~group)
df %>%
ggplot(aes(session, W_total)) +
geom_boxplot() +
facet_wrap(~group)
# try more
df %>%
ggplot(aes(session, F_total)) +
geom_violin() +
geom_dotplot(binaxis = "y", stackdir = "center", alpha = .3, binwidth = .1) +
facet_wrap(~group)
# Q: Is binwidth set without specific standard ('exploring multiple widths to find the best to illustrate the stories in your data')? I found the pots was too big when binwidth = 1 (according to data_wrangling_and_plotting_week3)?.
df %>%
ggplot(aes(session, W_total)) +
geom_violin() +
geom_jitter(alpha = .3) +
facet_wrap(~group)
# Q: Is the distribution is the original without any calculating or rotation? If so, I prefer this plot to the next one.
# ['The jitter geom is a convenient shortcut for geom_point(position = "jitter"). It adds a small amount of random variation to the location of each point, and is a useful way of handling overplotting caused by discreteness in smaller datasets.']
df %>%
ggplot(aes(session, W_total)) +
geom_violin() +
geom_dotplot(binaxis = "y", stackdir = "center", alpha = .3, binwidth = .1) +
facet_wrap(~group)
# ['stackdir: which direction to stack the dots. "up" (default), "down", "center", "centerwhole" (centered, but with dots aligned)']
Fit a linear regression model with F_total as the DV, and session and group as predictors.
# where is 'DV'?
# LM with an interaction effect
F_total.model.1 <- lm(F_total ~ session * group, data = df)
summary(F_total.model.1)
##
## Call:
## lm(formula = F_total ~ session * group, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3000 -0.8667 -0.3000 0.7000 3.7000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4000 0.2740 5.109 1.28e-06 ***
## session2 0.9000 0.3875 2.323 0.0219 *
## group2 1.9000 0.3875 4.903 3.10e-06 ***
## session2:group2 -1.3333 0.5480 -2.433 0.0165 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.501 on 116 degrees of freedom
## Multiple R-squared: 0.1883, Adjusted R-squared: 0.1673
## F-statistic: 8.969 on 3 and 116 DF, p-value: 2.163e-05
F_total_coef <- broom::tidy(F_total.model.1) %>%
select(term, estimate) %>%
mutate(estimate = round(estimate, 2)) %>% # round decimals for plot text
spread(term, estimate) %>%
rename(Intercept = `(Intercept)`,
group_coef = group,
session_coef = session,
session_group_coef = `session:group`)
error: Can't rename columns that don't exist. x Column `group` doesn't exist.
# I failed to rename the columns above so this chunk had error. I have to delet {r} to retain the codes.
(F_total_plot <- broom::augment(F_total.model.1, se_fit = T) %>%
ggplot(aes(session, F_total)) +
geom_point(aes(color = group), alpha = .7) +
geom_line(aes(session, .fitted, color = group), size = 1) +
geom_ribbon(aes(ymin=.fitted-1.96*.se.fit, ymax=.fitted+1.96*.se.fit, fill = group), alpha=0.2) +
theme_bw())
# plot annotations
F_total_plot +
geom_point(aes(0, F_total_coef$Intercept)) + # mark the intercept point
geom_text(aes(0.35, F_total_coef$Intercept,
label = paste("Intercept =", F_total_coef$Intercept)), vjust=-.9) +
geom_text(aes(4.2, F_total_coef$Intercept + F_total_coef$session_coef * 4.2, # annotate session coefficient
label = paste("Slope =", F_total_coef$session_coef)),
vjust = -.9) +
geom_segment(aes(x = 1.3, y = F_total_coef$Intercept + F_total_coef$session_coef * 1.3, # draw arrow to mark gender coefficient
xend = 1.3, yend = F_total_coef$Intercept + F_total_coef$session_coef * 1.3 + F_total_coef$group_coef * 1),
arrow = arrow()) +
geom_text(aes(1.3, F_total_coef$Intercept + F_total_coef$session_coef * 1.3 + F_total_coef$gender_coef * 1,
label = paste("Female coef =", F_total_coef$group_coef)),
vjust = 2, hjust = 1.1) +
geom_segment(aes(x = 1.3, y = F_total_coef$Intercept + F_total_coefsession_coef * 1.3 + F_total_coef$group_coef * 1,
xend = 1.3, yend = F_total_coef$Intercept + F_total_coef$session_coef * 1.3 + GPA_coef$group_coef * 1 + F_total_coef$session_group_coef * 1.3),
arrow = arrow()) +
geom_text(aes(1.3, F_total_coef$Intercept + F_total_coef$session_coef * 1.3 + F_total_coef$group_coef * 1 + F_total_coef$session_group_coef * 1.3,
label = paste("Interaction coef =", F_total_coef$session_group_coef)),
vjust = 2, hjust = 1.1)
Look at the means of F_total by group and session. How are they linked to the linear regression model coefficients?
grouped_F_total1 <- df %>%
group_by(group)
grouped_F_total1 %>%
summarise(F_total1 = mean(F_total))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## group F_total1
## <fct> <dbl>
## 1 1 1.85
## 2 2 3.08
grouped_F_total2 <- df %>%
group_by(session)
grouped_F_total2 %>%
summarise(F_total2 = mean(F_total))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## session F_total2
## <fct> <dbl>
## 1 1 2.35
## 2 2 2.58
Visualise the anscombe dataset using ggplot2.
Prepare a separate R Notebook/Markdown document, which will be the first draft of your final assignment with your own data. In the draft, include the following:
Even if you had already completed some of these steps before, please include all of them in your document. NOTE: Return either a readable HTML document (.html or .nb.html), or an .Rmd file along with your data, to make it possible for us to review your work! Make the document as professional-looking as possible (you can, of course, include your comments/questions in the draft). You will get feedback on the draft, based on which you can then make the final version. The final document should be a comprehensive report of your data wrangling process and results.
# Sorry, recently I have other urgent task for my article and do not have enough time to do this part.